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Abstract 

Distributional models that learn rich seman¬ 
tic word representations are a success story 
of recent NLP research. However, develop¬ 
ing models that learn useful representations of 
phrases and sentences has proved far harder. 

We propose using the definitions found in 
everyday dictionaries as a means of bridg¬ 
ing this gap between lexical and phrasal se¬ 
mantics. Neural language embedding mod¬ 
els can be effectively trained to map dictio¬ 
nary definitions (phrases) to (lexical) repre¬ 
sentations of the words defined by those defi¬ 
nitions. We present two applications of these 
architectures: reverse dictionaries that return 
the name of a concept given a definition or 
description and general-knowledge crossword 
question answerers. On both tasks, neural lan¬ 
guage embedding models trained on defini¬ 
tions from a handful of freely-available lex¬ 
ical resources perform as well or better than 
existing commercial systems that rely on sig¬ 
nificant task-specific engineering. The re¬ 
sults highlight the effectiveness of both neu¬ 
ral embedding architectures and definition- 
based training for developing models that un¬ 
derstand phrases and sentences. 

1 Introduction 

Much recent research in computational seman¬ 
tics has focussed on learning representations of 
arbitrary-length phrases and sentences. This task is 
challenging partly because there is no obvious gold 
standard of phrasal representation that could be used 

* Work mainly done at the University of Montreal. 


in training and evaluation. Consequently, it is diffi¬ 
cult to design approaches that could learn from such 
a gold standard, and also hard to evaluate or compare 
different models. 

In this work, we use dictionary definitions to ad¬ 
dress this issue. The composed meaning of the 
words in a dictionary definition {a tall, long-necked, 
spotted ruminant of Africa) should correspond to 
the meaning of the word they define (giraffe). This 
bridge between lexical and phrasal semantics is use¬ 
ful because high quality vector representations of 
single words can be used as a target when learning 
to combine the words into a coherent phrasal repre¬ 
sentation. 

This approach still requires a model capable of 
learning to map between arbitrary-length phrases 
and fixed-length continuous-valued word vectors. 
For this purpose we experiment with two broad 
classes of neural language models (NLMs): Recur¬ 
rent Neural Networks (RNNs), which naturally en¬ 
code the order of input words, and simpler (feed¬ 
forward) bag-of-words (BOW) embedding models. 
Prior to training these NLMs, we learn target lexi¬ 
cal representations by training the Word2Vec soft¬ 
ware (Mikolov et ah, 2013) on billions of words of 
raw text. 

We demonstrate the usefulness of our approach 
by building and releasing two applications. The 
first is a reverse dictionary or concept finder: a sys¬ 
tem that returns words based on user descriptions 
or definitions (Zock and Bilac, 2004). Reverse dic¬ 
tionaries are used by copywriters, novelists, trans¬ 
lators and other professional writers to find words 
for notions or ideas that might be on the tip of their 



tongue. For instance, a travel-writer might look to 
enhance her prose by searching for examples of a 
country that people associate with warm weather or 
an activity that is mentally or physically demand¬ 
ing. We show that an NLM-based reverse dictionary 
trained on only a handful of dictionaries identifies 
novel definitions and concept descriptions compara¬ 
bly or better than commercial systems, which rely 
on significant task-specific engineering and access 
fo much more dictionary dafa. Moreover, by ex¬ 
ploiting models fhaf learn bilingual word represen- 
fafions (Vulic el ah, 2011; Klementiev el ah, 2012; 
Hermann and Blunsom, 2013; Gouws el ah, 2014), 
we show lhal Ihe NLM approach can be easily ex¬ 
tended lo produce a polenlially useful cross-lingual 
reverse dictionary. 

The second applicafion of our models is as a 
general-knowledge crossword quesfion answerer. 
When Irained on bolh dictionary definilions and Ihe 
opening senlences of Wikipedia articles, NLMs pro¬ 
duce plausible answers lo (non-cryplic) crossword 
clues, even Ihose lhal apparenlly require delailed 
world knowledge. Bolh BOW and RNN models can 
oulperform bespoke commercial crossword solvers, 
particularly when clues conlain a greater number of 
words. Qualilalive analysis reveals lhal NLMs can 
learn lo relate concepls lhal are nol direclly con¬ 
nected in Ihe Iraining dala and can Ihus generalise 
well lo unseen inpul. To facililale furlher research, 
all of our code, Iraining and evaluation sels (logelher 
wilh a system demo) are published online wilh Ihis 
paper. ^ 

2 Neural Language Model Architectures 

The lirsl model we apply lo Ihe dictionary-based 
learning lask is a recurrenl neural nelwork (RNN). 
RNNs operate on variable-lenglh sequences of in¬ 
puls; in our case, nalural language definitions, 
descriptions or sentences. RNNs (wilh LSTMs) 
have achieved slale-of-lhe-arl performance in lan¬ 
guage modelling (Mikolov el ah, 2010), image cap¬ 
tion generation (Kiros el ah, 2015) and approach 
slale-of-lhe-arl performance in machine Iransla- 
lion (Bahdanau el ah, 2015). 

During Iraining, Ihe inpul lo Ihe RNN is a dic¬ 
tionary definition or sentence from an encyclope- 

* https://www.cl.cam.ac.uk/~fh295/ 


dia. The objective of Ihe model is lo map Ihese 
defining phrases or sentences lo an embedding of 
Ihe word lhal Ihe definition defines. The lar- 
gel word embeddings are learned independenlly 
of Ihe RNN weighls, using Ihe Word2Vec soft¬ 
ware (Mikolov el ah, 2013). 

The sel of all words in Ihe Iraining dala consli- 
lules Ihe vocabulary of Ihe RNN. For each word in 
Ibis vocabulary we randomly initialise a real-valued 
vector (inpul embedding) of model parameters. The 
RNN ‘reads’ Ihe lirsl word in Ihe inpul by applying 
a non-linear projection of ils embedding vi parame- 
lerised by inpul weighl malrix W and b, a vector of 
biases. 

Ai = (f){Wvi -\- b) 

yielding Ihe lirsl internal activation slate Ai. In our 
implemenlalion, we use (f){x) = tanh(x), Ihough in 
Iheory 0 can be any differentiable non-linear func¬ 
tion. Subsequenl internal activations (after lime-step 
t) are computed by projecting Ihe embedding of Ihe 
word and using Ibis information to ‘update’ Ihe 
internal activation slate. 

At = (t>{UAt-i + Wvt + b). 

As such, Ihe values of Ihe final internal activation 
slate unils A at are a weighted function of all inpul 
word embeddings, and conslilule a ‘summary’ of Ihe 
information in Ihe sentence. 

2.1 Long Short Term Memory 

A known limilalion when Iraining RNNs to read lan¬ 
guage using gradienl descenl is lhal Ihe error sig¬ 
nal (gradienl) on Ihe Iraining examples eilher van¬ 
ishes or explodes as Ihe number of time steps (sen¬ 
tence lenglh) increases (Bengio el ah, 1994). Con- 
sequenlly, after reading longer sentences Ihe final 
internal activation A^r typically relains useful in¬ 
formation aboul Ihe mosl recenlly read (sentence- 
final) words, bul can neglecl imporlanl informa¬ 
tion near Ihe slarl of Ihe inpul sentence. LSTMs 
(Hochreiler and Schmidhuber, 1997) were designed 
to mitigate Ihis long-term dependency problem. 

Al each time step t, in place of Ihe single inter¬ 
nal layer of unils A, Ihe LSTM RNN computes six 
internal layers ,g^ ,g° ,h and m. The lirsl, g'^, 

represenls Ihe core information passed to Ihe LSTM 



unit by the latest input word at t. It is eomputed as 
a simple linear projection of the input embedding 
vt (by input weights W^) and the output state of 
the LSTM at the previous time step ht-i (by update 
weights Um): 

it — “h Uijjht—x ~\~ hyj 

The layers p*, and g° are computed as weighted 
sigmoid functions of the input embeddings, again 
parameterised by layer-specific weight matrices W 
and U: 


1 + ex.p{-(WsVt + Usht-i + bs)) 

where s stands for one of f, / or o. These vectors 
take values on [0,1] and are often referred to as gat¬ 
ing activations. Finally, the internal memory state, 
mt and new output state ht, of the LSTM at t are 
computed as 

mt =it © gl + mt-i 0 g{ 
ht =9t © 4>{mt), 

where © indicates elementwise vector multiplica¬ 
tion and (j) is, as before, some non-linear function 
(we use tank). Thus, g'^ determines to what extent 
the new input word is considered at each time step, 
gf determines to what extent the existing state of 
the internal memory is retained ov forgotten in com¬ 
puting the new internal memory, and g^ determines 
how much this memory is considered when comput¬ 
ing the output state at t. 

The sentence-final memory sfafe of fhe LSTM, 
niN, a ‘summary’ of all fhe informalion in fhe sen- 
fence, is fhen projecfed via an exfra non-linear pro¬ 
jection (parameferised by a furfher weighl mafrix) 
fo a largel embedding space. This layer enables fhe 
largel (defined) word embedding space fo fake a dif- 
ferenf dimension fo fhe acfivafion layers of fhe RNN, 
and in principle enables a more complex definilion- 
reading funclion fo be learned. 

2.2 Bag-of-Words NLMs 

We implemenf a simpler linear bag-of-words (BOW) 
archifecfure for encoding fhe definilion phrases. As 
wifh fhe RNN, fhis archifecfure learns an embedding 
Vi for each word in fhe model vocabulary, fogefher 
wifh a single mafrix of inpuf projection weighfs W. 


The BOW model simply maps an inpuf definifion 
wifh word embeddings vi .. .Vn to fhe sum of fhe 
projecfed embeddings Wvi. This model can 
also be considered a special case of an RNN in 
which fhe updafe function U and nonlinearity f are 
bofh fhe idenfify, so fhaf ‘reading’ fhe nexf word in 
fhe inpuf phrase updafes fhe currenf represenfafion 
more simply: 

At = At-i + Wvt. 

2.3 Pre-trained Input Representations 

We experiment with variants of these models in 
which the input definition embeddings are pre¬ 
learned and fixed (rather than randomly-initialised 
and updated) during training. There are several po¬ 
tential advantages to taking this approach. First, the 
word embeddings are trained on massive corpora 
and may therefore introduce additional linguistic or 
conceptual knowledge to the models. Second, at test 
time, the models will have a larger effective vocab¬ 
ulary, since the pre-trained word embeddings typi¬ 
cally span a larger vocabulary than the union of all 
dictionary definitions used to train the model. Fi¬ 
nally, the models will then map to and from the same 
space of embeddings (the embedding space will be 
closed under the operation of the model), so con¬ 
ceivably could be more easily applied as a general- 
purpose ‘composition engine’. 

2.4 Training Objective 

We train all neural language models M to map the 
input definition phrase Sc defining word c to a lo¬ 
cation close to the the pre-trained embedding Vc of 
c. We experiment with two different cost functions 
for the word-phrase pair (c, Sc) from the training 
data. The first is simply the cosine distance between 
M {sc) and Vc- The second is the rank loss 

max(0, m — cos{M{sc),Vc) — cos(M(sc), w)) 

where Vr is the embedding of a randomly-selected 
word from the vocabulary other than c. This loss 
function was used for language models, for example, 
in (Huang et al., 2012). In all experiments we apply 
a margin m = 0.1, which has been shown to work 
well on word-retrieval tasks (Bordes et al., 2015). 



2.5 Implementation Details 


3 Reverse Dictionaries 


Since training on the dictionary data took 6-10 
hours, we did not conduct a hyper-parameter search 
on any validation sets over the space of possible 
model configurations such as embedding dimension, 
or size of hidden layers. Instead, we chose these 
parameters to be as standard as possible based on 
previous research. For fair comparison, any aspects 
of model design that are not specific to a particu¬ 
lar class of model were kept constant across experi¬ 
ments. 

The pre-trained word embeddings used in all of 
our models (either as input or target) were learned 
by a continuous bag-of-words (CBOW) model using 
the Word2Vec software on approximately 8 billion 
words of running text.^ When training such models 
on massive corpora, a large embedding length of up 
to 700 have been shown to yield best performance 
(see e.g. (Faruqui et ah, 2014)). The pre-trained em¬ 
beddings used in our models were of length 500, 
as a compromise between quality and memory con¬ 
straints. 

In cases where the word embeddings are learned 
during training on the dictionary objective, we make 
these embeddings shorter (256), since they must 
be learned from much less language data. In the 
RNN models, and at each time step each of the 
four LSTM RNN internal layers (gating and activa¬ 
tion states) had length 512 - another standard choice 
(see e.g. (Cho et ah, 2014)). The final hidden state 
was mapped linearly to length 500, the dimension 
of the target embedding. In the BOW models, the 
projection matrix projects input embeddings (either 
learned, of length 256, or pre-trained, of length 500) 
to length 500 for summing. 

All models were implemented with 
Theano (Bergstra et ah, 2010) and trained with 
minihatch SGD on GPUs. The batch size was 
fixed at 16 and the learning rate was controlled by 
adadelta (Zeiler, 2012). 

^The Word2Vec embedding models are 
well known; further details can be found 
at https://code.google.eom/p/word2vec/ The 
training data for this pre-training was compiled from various 
online text sources using the script demo-train-big-model-vl.sh 
from the same page. 


The most immediate application of our trained mod¬ 
els is as a reverse dictionary or concept finder. It 
is simple to look up a definition in a dictionary 
given a word, but professional writers often also re¬ 
quire suitable words for a given idea, concept or 
definition.^ Reverse dictionaries satisfy this need 
by returning candidate words given a phrase, de¬ 
scription or definition. For instance, when queried 
with the phrase an activity that requires strength 
and determination, the OneLook.com reverse dictio¬ 
nary returns the concepts exercise and work. Our 
trained RNN model can perform a similar func¬ 
tion, simply by mapping a phrase to a point in the 
target (Word2Vec) embedding space, and returning 
the words corresponding to the embeddings that are 
closest to that point. 

Several other academic studies have proposed 
reverse dictionary models. These generally rely 
on common techniques from information retrieval, 
comparing definitions in their internal database to 
the input query, and returning the word whose def¬ 
inition is ‘closest’ to that query (Bilac et ah, 2003; 

Bilac et ah, 2004; Zock and Bilac, 2004). Proxim¬ 
ity is quantified differently in each case, but is gen¬ 
erally a function of hand-engineered features of the 
two sentences. For instance, Shaw et al. (2013) pro¬ 
pose a method in which the candidates for a given 
input query are all words in the model’s database 
whose definitions contain one or more words from 
the query. This candidate list is then ranked accord¬ 
ing to a query-definition similarity metric based on 
the hypernym and hyponym relations in WordNet, 
features commonly used in IR such as tf-idf and a 
parser. 

There are, in addition, at least two commercial 
online reverse dictionary applications, whose ar¬ 
chitecture is proprietary knowledge. The first is 
the Dictionary.com reverse dictionary which re¬ 
trieves candidate words from the Dictionary.com 
dictionary based on user definitions or descrip¬ 
tions. The second is OneLook.com, whose algo¬ 
rithm searches 1061 indexed dictionaries, including 
all major freely-available online dictionaries and re- 

^See the testimony from professional writers at 
http://WWW.onelook.com/?c=awards 

‘'Available at http:/ / diet ionary . reference . com/reverse/ 



3.2 Comparisons 


sources such as Wikipedia and WordNet. 

3.1 Data Collection and Training ^ baseline, we also implemented two entirely 

unsupervised methods using the neural (Word2Vec) 
To compile a bank of dictionary definitions for train- embeddings from the target word space. In the 

ing the model, we started with all words in the tar- (^ 2 V add), we compose the embeddings for 

get embedding space. For each of these words, we word in the input query by pointwise addition, 

extracted dictionary-style definitions from five elec- ^^d refurn as candidafes fhe nearesf word embed- 
fronic resources: Wordnet, The American Heritage ^ings fo fhe resulting composed vector.^ The sec- 
Dictionary, The Collaborative International Dictio- ^^d baseline, (W2V mult), is idenfical excepf fhaf 
nary of English, Wiktionary and Webster s. We embeddings are composed by elemenfwise mul- 
chose fhese five dictionaries because fhey are freely- tippgation. Bofh mefhods are esfablished ways of 
available via fhe WordNik API, buf in fheory any building phrase represenfafions from word embed- 
dicfionary could be chosen. Mosf words in our frain- (Mifchell and Lapafa, 2010). 

ing dafa had mulfiple definifions. For each word from previous 

w wifh definifions {di ... dn} we included all pairs dictionaries is pub- 

{w,di)... {w, dn) as framing examples. available, so direcf comparison is nol possi- 

To allow models access fo more facfual knowl- ^,jg However, we do compare performance wifh 

edge fhan mighf be presenf in a dicfionary (for in- ^j^g commercial systems. The Dictionary.com sys- 

sfance, information abouf specific enfifies, places or ^gj^r returned no candidates for over 96% of our in¬ 
people, we supplemented this training data with in- definitions. We therefore conduct detailed corn- 
formation extracted from Simple Wikipedia. For with OneLook.com, which is the first re- 

every word in the model’s target embedding space ^gr,g dictionary tool returned by a Google search 
that is also the title of a Wikipedia article, we treat ^gg^r^^ ^ be the most popular among writers, 

the sentences in the first paragraph of the article as 

if they were (independent) definitions of that word. 3,3 Reverse Dictionary Evaluation 
When a word in Wikipedia also occurs in one (or 

more) of the five fraining dictionaries, we simply ^o our knowledge fhere are no esfablished means of 

add fhese pseudo-definifions fo fhe fraining sef of measuring reverse dicfionary performance. In fhe 

definifions for fhe word. Combining Wikipedia and academic research on English reverse 

dictionaries in fhis way resulted in p. 900,000 word- dictionaries fhaf we are aware of, evaluation was 

’definition’ pairs of ^ 100,000 unique words. conducted on 300 word-definition pairs wriffen by 

To explore fhe effecf of fhe quanfify of fraining lexicographers (Shaw ef ah, 2013). Since fhese are 

. , t , , nof publicly available we developed new evaluafion 

dafa on fhe performance of fhe models, we also e j e 

. .. r;., sofs and make fhem freely available for fulure eval- 

frained models on subsefs of fhis dafa. The firsl sub- , ■’ 

sef comprised only definifions from Wordnef (ap- uafions. 

proximafely 150,000 definifions of 75,000 words). The evaluafion ifems are of fhree fypes, designed 

The second subsef comprised only words in Word- ^ ^^1 differenl properties of fhe models. To cre¬ 

nel and Iheir first definitions (approximalely 75,000 fhe seen evaluafion, we randomly selecfed 500 
word, definition pairs).". For all varianfs of RNN '^ords from fhe WordNef fraining dafa (seen by all 
and BOW models, however, reducing fhe fraining models), and fhen randomly selected a definition for 
dafa in fhis way resulfed in a clear reduction in per- word. Testing models on fhe resulting 500 

formance on all fasks. For brevity, we Iherefore do word-definifion pairs assesses Iheir abilify fo recall 

nol presenf fhese resulls in whaf follows. o*' decode previously encoded information. For fhe 

_ unseen evaluafion, we randomly selected 500 words 

^See http: //developer. wordnik. com from WordNel and excluded all definitions of fhese 

^https://simple.wikipedia.org/wiki/Main_Page_ 

^As with other dictionaries, the first definition in WordNet *Since we retrieve all answers from embedding spaces by 

generally corresponds to the most typical or common sense of a cosine similarity, addition of word embeddings is equivalent to 
word. taking the mean. 



Test Set 

Dictionar 

Seen (500 WN defs) 

y definitions 

1 Unseen (500 WN defs) 

Concept descriptions (200) | 

Unsup. 

W2V add 

- 

- 

- 

923 

.04/. 16 

163 

339 

.07/.30 

150 

models 

W2V mult 

- 

- 

- 

1000 

.OO/.OO 


1000 

.OO/.OO 

27* 


OneLook 

0 

.89/.91 

67 

- 

- 

- 

18.5 

.38/.58 

153 


RNN cosine 

12 

.48/.73 

103 

22 

.41/.70 

116 

69 

.28/.54 

157 


RNN w2v cosine 

19 

.44/.70 

111 

19 

.44/.69 

126 

26 

.38/.66 

111 


RNN ranking 

18 

.45/.67 

128 

24 

.43/.69 

103 

25 

.34/.66 

102 

NLMs 

RNN w2v ranking 

54 

.32/.56 

155 

33 

.36/.65 

137 

30 

.33/.69 

77 


BOW cosine 

22 

.44/.65 

129 

19 

.43/.69 

103 

50 

.34/.60 

99 


BOW w2v cosine 

15 

.46/.71 

124 

14 

.46/ .71 

104 

28 

.561.66 

99 


BOW ranking 

17 

.45/.68 

115 

22 

.42/.70 

95 

32 

.35/.69 

101 


BOW w2v rankng 

55 

321.56 

155 

36 

.35/.66 

138 

38 

.331.12 

85 


I median rank accuracy® 10/100 rank variance | 

Table 1: Performance of different reverse dictionary models in different evaluation settings. *Low variance in mult 
models is due to consistently poor scores, so not highlighted. 


words from the training data of all models. 

Finally, for a fair eomparison with OneLook, 
whieh has both the seen and unseen pairs in its in¬ 
ternal database, we built a new dataset of concept 
descriptions that do not appear in the training data 
for any model. To do so, we randomly seleeted 200 
adjeetives, nouns or verbs from among the top 3000 
most frequent tokens in the British National Cor¬ 
pus (Leeeh et ah, 1994) (but outside the top 100). 
We then asked ten native English speakers to write 
a single-sentenee ‘deseription’ of these words. To 
ensure the resulting deseriptions were good qual¬ 
ity, for eaeh deseription we asked two partieipants 
who did not produee that deseription to list any 
words that fitted the deseription (up to a maximum 
of three). If the target word was not produeed by 
one of the two eheekers, the original partieipant was 
asked to re-write the deseription until the validation 
was passed.^ These eoneept deseriptions, together 
with other evaluation sets, ean be downloaded from 
our website for future eomparisons. 

Test set 

Dictionary 
definition 
Concept 
description 

Table 2: Style difference between dictionary definitions 
and concept descriptions in the evaluation. 

®Re-writing was required in 6 of the 200 cases. 


Given a test deseription, definition, or question, 
all models produee a ranking of possible word an¬ 
swers based on the proximity of their representations 
of the input phrase and all possible output words. 
To quantify the quality of a given ranking, we re¬ 
port three statisties: the median rank of the eorreet 
answer (over the whole test set, lower better), the 
proportion of training eases in whieh the eorreet an¬ 
swer appears in the top 10/100 in this ranking {accu¬ 
racy® 10/100 - higher better) and the varianee of the 
rank of the eorreet answer aeross the test set {rank 
variance - lower better). 


3.4 Results 


valve 


prefer 


Table 1 shows the performanee of the different mod¬ 
els in the three evaluation settings. Of the unsu¬ 
pervised eomposition models, elementwise addition 
is elearly more effeetive than multiplieation, whieh 
almost never returns the eorreet word as the near¬ 
est neighbour of the eomposition. Overall, however, 
the supervised models (RNN, BOW and OneLook) 
elearly outperform these baselines. 

’’control consisting of a mechanical The results indieate interesting differenees be- 
device for controlling fluid flow” tween the NLMs and the OneLook dietionary seareh 


Word Description 


’’when you like one thing 
more than another thing” 


engine. The Seen (WN first) definitions in Table 1 
oeeur in both the training data for the NLMs and 
the lookup data for the OneLook model. Clearly the 
OneLook algorithm is better than NLMs at retriev¬ 
ing already available information (returning 89% of 
eorreet words among the top-ten eandidates on this 












set). However, this is likely to come at the cost of a 
greater memory footprint, since the model requires 
access to its database of dictionaries at query time.^*^ 

The performance of the NLM embedding models 
on the (unseen) concept descriptions task shows that 
these models can generalise well to novel, unseen 
queries. While the median rank for OneLook on 
this evaluation is lower, the NLMs retrieve the cor¬ 
rect answer in the top ten candidates approximately 
as frequently, within the top 100 candidates more 
frequently and with lower variance in ranking over 
the test set. Thus, NLMs seem to generalise more 
‘consistenly’ than OneLook on this dataset, in that 
they generally assign a reasonably high ranking to 
the correct word. In contrast, as can also be verified 
by querying our we demo, OneLook tends to per¬ 
form either very well or poorly on a given query. ^ * 

When comparing between NLMs, perhaps the 
most striking observation is that the RNN models 
do not significantly outperform the BOW models, 
even though the BOW model output is invariant to 
changes in the order of words in the definition. Users 
of the online demo can verify that the BOW models 
recover concepts from descriptions strikingly well, 
even when the words in the description are per¬ 
muted. This observation underlines the importance 
of lexical semantics in the interpretation of language 
by NLMs, and is consistent with some other recent 
work on embedding sentences (lyyer et ah, 2015). 

It is difficult to observe clear trends in the dif¬ 
ferences between NLMs that learn input word em¬ 
beddings and those with pre-trained (Word2Vec) in¬ 
put embeddings. Both types of input yield good 
performance in some situations and weaker perfor¬ 
mance in others. In general, pre-training input em¬ 
beddings seems to help most on the concept de¬ 
scriptions, which are furthest from the training data 
in terms of linguistic style. This is perhaps unsur¬ 
prising, since models that learn input embeddings 
from the dictionary data acquire all of their concep¬ 
tual knowledge from this data (and thus may over¬ 
fit to this setting), whereas models with pre-trained 

'®The trained neural language models are approximately half 
the size of the six training dictionaries stored as plain text, so 
would be hundreds of times smaller than the OneLook database 
of 1061 dictionaries if stored this way. 

'*We also observed that the mean ranking for NLMs was 
lower than for OneLook on the concept descriptions task. 


embeddings have some semantic memory acquired 
from general running-text language data and other 
knowledge acquired from the dictionaries. 

3.5 Qualitative Analysis 

Some example output from the various models is 
presented in Table 3. The differences illustrated 
here are also evident from querying the web demo. 
The first example shows how the NLMs (BOW and 
RNN) generalise beyond their training data. Four 
of the top five responses could be classed as ap¬ 
propriate in fhaf fhey refer fo inhabifanfs of cold 
counfries. However, inspecfing fhe WordNik frain- 
ing dafa, fhere is no menfion of cold or anyfhing fo 
do wifh climate in fhe definilions of Eskimo, Scandi¬ 
navian, Scandinavia efc. Therefore, fhe embedding 
models musf have learned fhaf coldness is a char- 
acferislic of Scandinavia, Siberia, Russia, relates fo 
Eskimos efc. via connecfions wifh ofher concepfs 
fhaf are described or defined as cold. In confrasl, 
fhe candidates produced by fhe OneLook and (unsu¬ 
pervised) W2V baseline models have nofhing fo do 
wifh coldness. 

The second example demonsfrafes how fhe NLMs 
generally refurn candidates whose linguistic or con- 
cepfual function is appropriafe fo fhe query. For a 
query referring explicifly fo a means, mefhod or pro¬ 
cess, fhe RNN and BOW models produce verbs in 
differenl forms or an appropriafe deverbal noun. In 
confrasl, OneLook relurns words of all types {aero¬ 
dynamics, draught) fhaf are arbifrarily relafed fo fhe 
words in fhe query. A similar effecl is apparenf in 
fhe fhird example. While fhe candidafes produced 
by fhe OneLook model are fhe correcf pari of speech 
(Noun), and relafed fo fhe query lopic, fhey are nof 
semanlically appropriafe. The dictionary embedding 
models are fhe only ones fhaf refurn a lisl of plausi¬ 
ble habits, fhe class of noun requesled by fhe inpul. 

3.6 Cross-Lingual Reverse Dictionaries 

We now show how the RNN architecture can be eas¬ 
ily modified to create a bilingual reverse dictionary 
- a system that returns candidate words in one lan¬ 
guage given a description or definition in another. 
A bilingual reverse dictionary could have clear ap¬ 
plications for translators or transcribers. Indeed, the 
problem of attaching appropriate words to concepts 
may be more common when searching for words in 



Input 

Description 

OneLook 

W2V add 

RNN 

BOW 

”a native of 

a cold 
country” 

I'.country 2:citizen 
3:foreign A'.naturalize 
5:cisco 

l:a 2.fhe 
3'.another A'.of 
5'.whole 

I'.eskimo 2:scandinavian 
3'.arctic A'.indian 

S'.siberian 

I'.frigid 2:cold 

3:icy A'.russian 
S'.indian 

”a way of 
moving 
through 
the air” 

I'.drag 2:whiz 
^'.aerodynamics A'.draught 
5:coefficient of drag 

I'.the 2:through 
3:a A'.moving 
5:in 

1 '.glide 2:scooting 
3:glides A'.gliding 

5'.flight 

1 '.flying 2:gliding 
3:glide A'.fly 
S'.scooting 

”a habit that 
might annoy 
your spouse” 

1 '.sisterinlaw 2:fatherinlaw 
3:motherinlaw A'.stepson 
5:stepchild 

I'.annoy 2:your 
3:might A'.that 
S'.either 

I'.bossiness 2:jealousy 
3:annoyance A'.rudeness 
S'.boorishness 

{'.infidelity 2:bossiness 
3:foible A'.unfaithfulness 
S'.adulterous 


Table 3: The top-five candidates for example queries (invented by the authors) from different reverse dictionary mod¬ 
els. Both the RNN and BOW models are without Word2Vec input and use the cosine loss. 


Input description 

RNN EN-FR 

W2V add 

RNN + Google 

”an emotion that you might feel 

triste, pitoyable 

insister, effectivement 

sentiment, regretter 

after being rejected” 

repugnante, epouvantable 

pourquoi, nous 

peur, aversion 

”a small black hying insect that 

mouche, canard 

attentivement, pouvions 

voler, faucon 

transmits disease and likes horses” 

hirondelle, pigeon 

pourrons, naturellement 

mouches, volant 


Table 4; Responses from cross-lingual reverse dictionary 
rect’ or potentially useful for a native French speaker. 

a second language than in a monolingual context. 

To create the bilingual variant, we simply 
replace the Word2Vec target embeddings with 
those from a bilingual embedding space. Bilin¬ 
gual embedding models use bilingual corpora 
to learn a space of representations of the words 
in two languages, such that words from ei¬ 
ther language that have similar meanings are 
close together (Hermann and Blunsom, 2013; 
Chandar et ah, 2014; Gouws et ah, 2014). For 
a test-of-concept experiment, we used English- 
French embeddings learned by the state-of-the-art 
BilBOWA model (Gouws et ah, 2014) from the 
Wikipedia (monolingual) and Europarl (bilingual) 
corpora. We trained the RNN model to map 
from English definitions to English words in the 
bilingual space. At test time, after reading an 
English definition, we then simply return the nearest 
Erench word neighbours to that definition. 

Because no benchmarks exist for quantitative 
evaluation of bilingual reverse dictionaries, we com- 

'^The approach should work with any bilingual embeddings. 
We thank Stephan Gouws for doing the training. 


models to selected queries. Underlined responses are ‘cor- 


pare this approach qualitatively with two alternative 
methods for mapping definitions to words across 
languages. The first is analogous to the W2V Add 
model of the previous section: in the bilingual em¬ 
bedding space, we first compose the embeddings of 
the English words in the query definition with ele¬ 
mentwise addition, and then return the Erench word 
whose embedding is nearest to this vector sum. The 
second uses the RNN monolingual reverse dictio¬ 
nary model to identify an English word from an En¬ 
glish definition, and then translates that word using 
Google Translate. 

Table 4 shows that the RNN model can be ef¬ 
fectively modified fo creafe a cross-lingual reverse 
dicfionary. If is perhaps unsurprising fhaf fhe W2V 
Add model candidafes are generally fhe lowesf in 
qualify given fhe performance of fhe mefhod in fhe 
monolingual selling. In comparing fhe fwo RNN- 
based melhods, fhe RNN (embedding space) model 
appears lo have fwo advanlages over fhe RNN -i- 
Google approach. Eirsl, if does nof require on¬ 
line access lo a bilingual word-word mapping as 
defined e.g. by Google Translate. Second, if less 






prone to errors eaused by word sense ambiguity. 
For example, in response to the query an emotion 
you feel after being rejected, the bilingual embed¬ 
ding RNN returns emotions or adjeetives deseribing 
mental states. In eontrast, the monolingual-i-Google 
model ineorreetly maps the plausible English re¬ 
sponse regret to the verbal infinitive regretter. The 
model makes the same error when responding to a 
description of a fly, returning the verb voter (to fly). 

3.7 Discussion 

We have shown that simply training RNN or BOW 
NLMs on six dictionaries yields a reverse dictionary 
that performs comparably to the leading commer¬ 
cial system, even with access to much less dictio¬ 
nary data. Indeed, the embedding models consis¬ 
tently return syntactically and semantically plausi¬ 
ble responses, which are generally part of a more 
coherent and homogeneous set of candidates than 
those produced by the commercial systems. We also 
showed how the architecture can be easily extended 
to produce bilingual versions of the same model. 

In the analyses performed thus far, we only test 
the dictionary embedding approach on tasks that it 
was trained to accomplish (mapping definitions or 
descriptions to words). In the next section, we ex¬ 
plore whether the knowledge learned by dictionary 
embedding models can be effectively transferred to 
a novel task. 

4 General Knowledge (crossword) 

Question Answering 

The automatic answering of questions posed in nat¬ 
ural language is a central problem of Artificial In¬ 
telligence. Although web search and IR techniques 
provide a means to find sites or documents related to 
language queries, at present, internet users requiring 
a specific fact must still sift through pages to locate 
the desired information. 

Systems that attempt to overcome this, via 
fully open-domain or general knowledge question¬ 
answering (open QA), generally require large 
teams of researchers, modular design and pow¬ 
erful infrastructure, exemplified by IBM’s Wat¬ 
son (Ferrucci et al., 2010). For this reason, much 
academic research focuses on settings in which 
the scope of the task is reduced. This has 


been achieved by restricting questions to a spe¬ 
cific topic or domain (Molla and Vicedo, 2007), 
allowing systems access to pre-specified pas¬ 
sages of text from which the answer can be in¬ 
ferred (lyyer et al., 2014; Weston et al., 2015), or 
centering both questions and answers on a par¬ 
ticular knowledge base (Berant and Fiang, 2014; 
Bordes et al., 2014). 

In what follows, we show that the dictionary em¬ 
bedding models introduced in the previous sections 
may form a useful component of an open QA sys¬ 
tem. Given the absence of a knowledge base or 
web-scale information in our architecture, we nar¬ 
row the scope of the task by focusing on general 
knowledge crossword questions. General knowl¬ 
edge (non-cryptic, or quick) crosswords appear in 
national newspapers in many countries. Crossword 
question answering is more tractable than general 
open QA for two reasons. First, models know the 
length of the correct answer (in letters), reducing 
the search space. Second, some crossword questions 
mirror definitions, in that they refer to fundamental 
properties of concepts (a twelve-sided shape) or re¬ 
quest a category member (a city in Egypt)}^ 

4.1 Evaluation 

General Knowledge crossword questions come in 
different styles and forms. We used the Eddie James 
crossword website to compile a bank of sentence- 
like general-knowledge questions. Eddie James is 
one of the UK’s leading crossword compilers, work¬ 
ing for several national newspapers. Our long ques¬ 
tion set consists of the first 150 questions (starting 
from puzzle #1) from his general-knowledge cross¬ 
words, excluding clues of fewer than four words 
and those whose answer was not a single word (e.g. 
kingjames). 

To evaluate models on a different type of clue, we 
also compiled a set of shorter questions based on 
the Guardian Quick Crossword. Guardian questions 
still require general factual or linguistic knowledge, 
but are generally shorter and somewhat more cryptic 
than the longer Eddie James clues. We again formed 

*^As our interest is in the language understanding, we 
do not address the question of fitting answers into a grid, 
which is the main concern of end-to-end automated crossword 
solvers (Littman et al., 2002). 

'^http://www.eddiejames.co.uk/ 



a list of 150 questions, beginning on 1 January 2015 
and excluding any questions with multiple-word an¬ 
swers. For clear contrast, we excluded those few 
questions of length greater than four words. Of these 
150 clues, a subset of 30 were single-word clues. 
All evaluation datasets are available online with the 
paper. 

As with the reverse dictionary experiments, can¬ 
didates are extracted from models by inputting def¬ 
initions and returning words corresponding to the 
closest embeddings in the target space. In this case, 
however, we only consider candidate words whose 
length matches the length specified in the clue. 


Test set 

Word 

Description 

Long 

Baudelaire 

’’French poet 

(150) 


and key figure 
in the development 
of Symbolism.” 

Short (120) 

satanist 

’’devil devotee” 

Single-Word (30) 

guilt 

’’culpability” 


Table 5: Examples of the different question types in the 
crossword question evaluation dataset. 


4.2 Benchmarks and Comparisons 

As with the reverse dictionary experiments, we 
compare RNN and BOW NLMs with a simple 
unsupervised baseline of elementwise addition of 
Word2Vec vectors in the embedding space (we 
discard the ineffective W2F mult baseline), again 
restricting candidates to words of the pre-specified 
length. We also compare to two bespoke online 
crossword-solving engines. The first. One Across 
(http://www.oneacross.com/) is the can¬ 
didate generation module of the award-winning 
Proverb crossword system (Littman et al., 2002). 
Proverb, which was produced by academic re¬ 
searchers, has featured in national media such 
as New Scientist, and beaten expert humans 
in crossword solving tournaments. The sec¬ 
ond comparison is with Crossword Maestro 
(http: //www. crosswordmaestro . com/), a 
commercial crossword solving system that handles 
both cryptic and non-cryptic crossword clues (we 
focus only on the non-cryptic setting), and has also 


been featured in national media. We are unable 
to compare against a third well-known automatic 
crossword solver. Dr Fill (Ginsberg, 2011), because 
code for Dr Fill’s candidate-generation module 
is not readily available. As with the RNN and 
baseline models, when evaluating existing systems 
we discard candidates whose length does not match 
the length specified in fhe clue. 

Cerfain principles conned fhe design of fhe ex- 
isfing commercial sysfems and differenliafe fhem 
from our approach. Unlike fhe NLMs, fhey each re¬ 
quire query-lime access lo large dalabases conlain- 
ing common crossword clues, dictionary definilions, 
fhe frequency wifh which words fypically appear 
as crossword solulions and ofher hand-engineered 
and lask-specific componenls (Liffman el al., 2002; 

Ginsberg, 2011). 

4.3 Results 

The performance of models on fhe various question 
lypes is presented in Table 6. When evalualing fhe 
Iwo commercial systems. One Across and Cross¬ 
word Maeslro, we have access lo web interfaces lhal 
relurn up lo approximately 100 candidates for each 
query, so can only reliably record membership of Ihe 
lop ten (accuracy® 10). 

On Ihe long questions, we observe a clear advan- 
lage for all dictionary embedding models over Ihe 
commercial systems and Ihe simple unsupervised 
baseline. Here, Ihe besl performing NLM (RNN 
wilh Word2Vec inpul embeddings and ranking loss) 
ranks the correct answer third on average, and in the 
top-ten candidates over 60% of the time. 

As the questions get shorter, the advantage of 
the embedding models diminishes. Both the unsu¬ 
pervised baseline and One Across answer the short 
questions with comparable accuracy to the RNN and 
BOW models. One reason for this may be the differ¬ 
ence in form and style between the shorter clues and 
the full definitions or encyclopedia sentences in the 
dictionary training data. As the length of the clue de¬ 
creases, finding Ihe answer often reduces lo general- 
ing synonyms {culpability - guilt), or category mem¬ 
bers {tall animal - girajfe). The commercial systems 
can relrieve good candidates for such clues among 
Iheir dalabases of entities, relationships and com- 

Seee.g. http: / /www. theguardian . com/crosswords Zeros swore 





Question Type I avg rank -accuracy® 10/100 - rank variance 



Long (150) 

Short (120) 

Single-Word (30) 

One Across 


.39/ 



.68/ 



.70/ 


Crossword Maestro 


.27/ 



.43/ 



.73/ 


W2V add 

42 

.31/.63 

92 

11 

.50/.78 

66 

2 

.79/.90 

45 

RNN cosine 

15 

.43/.69 

108 

l2 

.391.61 

117 

l2“ 

.31/.52 

187 

RNN w2v cosine 

4 

.61/.82 

60 

7 

.561.19 

60 

12 

.48/.72 

116 

RNN ranking 

6 

.58/.84 

48 

10 

.51/.73 

57 

12 

.48/.69 

67 

RNN w2v ranking 

3 

.62/.80 

61 

8 

.511.12, 

49 

12 

.48/.69 

114 

BOW cosine 

4 

.60/.82 

54 

7 

.561.12 

51 

12 

.45/.72 

137 

BOW w2v cosine 

4 

.60/.83 

56 

7 

.54/.80 

48 

3 

.591.19 

111 

BOW ranking 

5 

.62/.87 

50 

8 

.58/.83 

37 

8 

.551.19 

39 

BOW w2v ranking 

5 

.60/.86 

48 

8 

.56/.83 

35 

4 

.551.23 

43 


Table 6: Performance of different models on crossword questions of different length. The two commercial systems 
are evaluated via their web interface so only accuracy @ 10 can be reported in those cases. 


mon crossword answers. Unsupervised Word2Vec 
representations are also known to encode these sorts 
of relationships (even after elementwise addition for 
short sequences of words) (Mikolov et ah, 2013). 
This would also explain why the dictionary embed¬ 
ding models with pre-trained (Word2Vec) input em¬ 
beddings outperfom those with learned embeddings, 
particularly for the shortest questions. 

4.4 Qualitative Analysis 

A better understanding of how the different models 
arrive at their answers can be gained from consider¬ 
ing specific examples, as presented in Table 7. The 
first three examples show that, despite the apparently 
superficial nafure of ifs fraining dafa (definilions and 
infroducfory sentences) embedding models can an¬ 
swer quesfions fhaf require facfual knowledge abouf 
people and places. Anofher nofable characferisfic of 
fhese model is fhe consisfenf semanfic appropriafe- 
ness of fhe candidate sef. In fhe firsl case, fhe lop 
five candidates are all mounfains, valleys or places in 
fhe Alps; in fhe second, fhey are all biblical names. 
In fhe fhird, fhe RNN model refrieves currencies, in 
Ibis case performing belter lhan fhe BOW model, 
which refrieves enlilies of various type associated 
wilh fhe Nelherlands. Generally speaking (as can 
be observed by fhe web demo), fhe ‘smoolhness’ or 
consislency in candidate generation of fhe dicfionary 
embedding models is greater lhan lhal of fhe com¬ 
mercial syslems. Despile ifs simplicity, fhe unsuper¬ 
vised W2V addition melhod is af limes also surpris¬ 


ingly effective, as shown by fhe fad fhaf if relurns 
Joshua in ifs fop candidales for fhe fhird query. 

The final example in Table 7 illuslrafes fhe sur¬ 
prising power of fhe BOW model. In fhe fraining 
dafa Ihere is a single definilion for fhe correcl an¬ 
swer Schoenberg: United States composer and musi¬ 
cal theorist (born in Austria) who developed atonal 
composition. The only word common fo bolh fhe 
query and fhe definilion is ’composer’ (Ihere is no 
lokenizalion lhal allows fhe BOW model lo direclly 
conned atonal and atonality). Neverlheless, fhe 
model is able lo infer fhe necessary connedions be- 
Iween fhe concepls in fhe query and fhe definition lo 
relurn Schoenberg as fhe lop candidale. 

Despife such cases, if remains an open ques- 
lion whelher, wilh more diverse fraining dafa, 
fhe world knowledge required for full open QA 
(e.g. secondary fads abouf Schoenberg, such 
as his family) could be encoded and relained as 
weighls in a (larger) dynamic nelwork, or whelher 
if will be necessary lo combine fhe RNN wilh 
an exlernal memory fhaf is less frequenlly (or 
never) updated. This latter approach has begun lo 
achieve impressive resulls on cerlain QA and enlail- 
menl lasks (Bordes el ah, 2014; Graves el ah, 2014; 
Weslon el ah, 2015). 

5 Conclusion 

Dictionaries exisl in many of Ihe world’s languages. 
We have shown how Ihese lexical resources can con- 
slilule valuable dala for fraining Ihe lalesl neural Ian- 





Input Description 

One Across 

Crossword Maestro 

BOW 

RNN 

’’Swiss mountain 

I'.noted 2:front 

1 '.after 2:favor 

1; Eiger 2.Crags 

1 ;Eiger 2:Aosta 

peak famed for its 

3:Eiger 4:crown 

3:ahead 4:along 

3:Teton A'.Cerro 

3:Cuneo A'.Lecco 

north face (5)” 

5:fount 

5'.being 

S'.Jebel 

S'.Tyrol 

’’Old Testament 

1; Joshua 2:Exodus 

1 '.devise 2:Daniel 

1 '.Isaiah 2:Elijah 

1 ; Joshua 2:Isaiah 

successor to 

y.Hebrew A'.person 

3:Haggai A: Isaiah 

3:Joshua 4:Elisha 

3:Gideon A'.Elijah 

Moses (6)” 

5:across 

5'.Joseph 

5:Yahweh 

5:Yahweh 

’’The former 

1 '.Holland 2:general 

I'.Holland 2:ancient 

LGuilder 2:Holland 

liGuilder 2:Escudos 

currency of the 

3:Lesotho 

3:earlier A'.onetime 

3:Drenthe A'.Utrecht 

3:Pesetas A'.Someren 

Netherlands 

(7)” 


S'.qondam 

S'.Naarden 

S'.Florins 

’’Arnold, 20th 

1 '.surrealism 

1 '.disharmony 

1: Schoenberg 

1 '.Mendelsohn 

Century composer 

2:laborparty 

2:dissonance 

2:Christleib 

2:Williamson 

pioneer of 

3:tonemusics 

3:bringabout 

3:Stravinsky 

3:Huddleston 

atonality 

A'.introduced 

A'.constitute 

A'.Elderfield 

A'.Mandelbaum 

(10)” 

5; Schoenberg 

S'.triggeroff 

5:Mendelsohn 

5:Zimmerman 


Table 7; Responses from different models to example crossword clues. In each case the model output is filtered to 
exclude any candidates that are not of the same length as the correct answer. BOW and RNN models are trained 
without Word2Vec input embeddings and cosine loss. 


guage models to interpret and represent the mean¬ 
ing of phrases and sentenees. While humans use 
the phrasal definitions in dietionaries to better un¬ 
derstand the meaning of words, maehines ean use 
the words to better understand the phrases. We used 
two dietionary embedding arehiteetures - a reeurrent 
neural network arehiteeture with a long-short-term 
memory, and a simpler linear bag-of-words model - 
to explieitly exploit this idea. 

On the reverse dietionary task that mirrors its 
training setting, NLMs that embed all known eon- 
eepts in a eontinuous-valued veetor spaee perform 
eomparably to the best known eommereial appliea- 
tions despite having aeeess to many fewer defini¬ 
tions. Moreover, they generate smoother sets of ean- 
didates and require no linguistie pre-proeessing or 
task-speeifie engineering. We also showed how the 
deseription-to-word objeetive ean be used to train 
models useful for other tasks. NLMs trained on the 
same data ean answer general-knowledge erossword 
questions, and indeed outperform eommereial sys¬ 
tems on questions eontaining more than four words. 
While our QA experiments foeused on erosswords, 
the results suggest that a similar embedding-based 
approaeh may ultimately lead to improved output 
from more general QA and dialog systems and in¬ 
formation retrieval engines in general. 


We make all eode, training data, evaluation sets 
and both of our linguistie tools publiely available on¬ 
line for future researeh. In partieular, we propose the 
reverse dietionary task as a eomparatively general- 
purpose and objeetive way of evaluating how well 
models eompose lexieal meaning into phrase or sen- 
tenee representations (whether or not they involve 
training on definitions direetly). 


In the next stage of this researeh, we will ex¬ 
plore ways to enhanee the NLMs deseribed here, 
espeeially in the question-answering eontext. The 
models are eurrently not trained on any question¬ 
like language, and would eoneeivably improve on 
exposure to sueh linguistie forms. We would also 
like to understand better how BOW models ean per¬ 
form so well with no ‘awareness’ of word order, 
and whether there are speeifie linguistie eontexts in 
whieh models like RNNs or others with the power 
to eneode word order are indeed neeessary. Finally, 
we intend to explore ways to endow the model with 
rieher world knowledge. This may require the in¬ 
tegration of an external memory module, similar to 
the promising approaehes proposed in several reeent 
papers (Graves et ah, 2014; Weston et ah, 2015). 
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